Flash Attention AI News List | Blockchain.News
AI News List

List of AI News about Flash Attention

Time Details
2026-02-03
21:49
Latest Analysis: FP8 Training Enables 4.3% Speedup for GPT-2 Model on H100 GPUs, Cost Drops to $20

According to Andrej Karpathy on Twitter, enabling FP8 precision training for GPT-2 using H100 GPUs has resulted in a 4.3% improvement in training time, reducing it to just 2.91 hours. Karpathy highlights that with 8xH100 spot instance pricing, the total cost to reproduce the GPT-2 model now stands at approximately $20. This marks a dramatic cost reduction compared to OpenAI's original $43,000 GPT-2 training seven years ago. As reported by Karpathy, further optimization using techniques such as Flash Attention 3 kernels, the Muon optimizer, and advanced attention patterns have contributed to these gains. While FP8 offers theoretical FLOPS advantages, Karpathy notes practical challenges including overhead from scale conversions and limited support, especially at the GPT-2 model scale. Nonetheless, the industry shift to FP8 hints at broader opportunities for cost-effective LLM training, as evidenced by torchao's reported 25% speedup on larger models like Llama3-8B. According to Karpathy, continued improvements in FP8 application and model training strategies can reduce both time and financial barriers for LLM development, opening further business and research opportunities.

Source
2026-02-03
21:49
Latest Analysis: FP8 Training Reduces GPT-2 Training Time to 2.91 Hours with H100 GPUs

According to Andrej Karpathy on Twitter, enabling FP8 training has improved 'time to GPT-2' by 4.3%, reducing the training duration to 2.91 hours on an 8x H100 GPU setup. Karpathy notes that, using spot instance pricing, the cost to reproduce GPT-2 training is now approximately $20. This marks a significant shift from GPT-2's original classification as 'too dangerous to release' in 2019 to being as accessible as MNIST today. The FP8 implementation presented practical challenges, with support limitations and real-world performance falling short of theoretical FLOPS gains. For tensorwise scaling, a speedup of about 7.3% was achieved, though Karpathy highlights that further optimizations could lower the time and cost even more. Comparatively, torchao reported a 25% speedup for Llama3-8B training using FP8. Karpathy also underscores that, thanks to advancements like Flash Attention 3 and the Muon optimizer, the cost of training GPT-2 has dropped nearly 600 times over the past seven years, offering substantial business opportunities for AI startups and researchers seeking low-cost, rapid model prototyping. As reported by Karpathy, ongoing optimizations in projects like nanochat continue to drive down training costs and times, making advanced language model training accessible to a wider audience.

Source
2026-01-31
20:55
Latest Analysis: nanochat Achieves GPT-2 Grade LLM Training for Under $100 Using Single 8XH100 Node

According to Andrej Karpathy on Twitter, nanochat can now train large language models (LLMs) with GPT-2 level capabilities for less than $100, specifically around $73 in just over 3 hours on a single 8XH100 node. This represents a dramatic reduction in both time and cost compared to the original GPT-2 training by OpenAI in 2019, which required 32 TPU v3 chips running for seven days at a total cost of approximately $43,000. The advancement leverages optimizations such as Flash Attention 3 kernels, the Muon optimizer, and improved residual pathways. As reported by Karpathy, these developments not only make LLM prototyping significantly more accessible but also demonstrate a continued trend of rapidly decreasing training costs, opening new business opportunities for startups and researchers in the AI field.

Source